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Abstract 



Cn \ In this paper we study the performance of the Projected Gradient Descent (PGD) algorithm for (.p- 

P^ ■ constrained least squares problems that arise in the framework of Compressed Sensing. Relying on the 

^ \ Restricted Isometry Property, we provide convergence guarantees for this algorithm for the entire range of 

< p < 1 , that include and generalize the existing results for the Iterative Hard Thresholding algorithm and 
^^ \ provide a new accuracy guarantee for the Iterative Soft Thresholding algorithm as special cases. Our results 

suggest that in this group of algorithms, as p increases from zero to one. conditions required to guarantee 
accuracy become stricter and robustness to noise deteriorates. 
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1. Introduction 



l/^ ■ Least squares problems occur in various signal processing and statistical inference applications. In these 

^ , problems the relation between the vector of noisy observations y G C" and the unknown parameter or signal 

X* g C" is governed by a linear equation of the form 



y = Ax* + e, (1) 



C^ , where A G c™x" jg a matrix that may model a linear system or simply contains a set of collected data. The 

^"^ ■ vector e g C" represents the additive observation noise. Estimating x* from the observation vector y is 

achieved by finding the x G C" that minimizes the squared error || Ax — y||2. This least squares approach, 
however, is well-posed only if the nuUspace of matrix A merely contains the zero vector. The cases in which 
the nuUspace is greater than the singleton {0} , as in underdetermined scenarios {m < n), are more relevant 
in a variety of applications. To enforce unique least squares solutions in these cases, it becomes necessary 
5— i \ to have some prior information about the structure of x* . 

■ One of the structural characteristics that describes parameters and signals of interest in a wide range of 

applications from medical imaging to astronomy is sparsity. Since the advent of the theory of compressed 
sensing^ development and analysis of algorithms that exploit sparsity for estimation in underdetermined 
problems have become important topics of study. In the absence of noise x* can be uniquely determined 
from the observation vector y = Ax*, provided that spark (A) > 2 |jx*||p (i.e., every 2 ||x*||p columns of A 
are linearly independent) [12]. Then the ideal estimation procedure could simply be finding the sparsest 
vector X that incurs no residual error (i.e., || Ax — y||, = 0). This ideal estimation method can be extended 
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to the case of noisy observations as well. Formally, given an upper bound e on the £2-norm of the noise, the 
vector X* can be estimated by solving the lo-TamumzEdion 

argmm ||x||o s.t. ||Ax-y|l2<e, (2) 

where ||x||q denotes the ^o-norni^ of the vector x that merely counts the number of its non-zero entries. 
However, this minimization problem is in general NP-hard [17]. To avoid the combinatorial computational 

cost of (2), often the ^o-norm is substituted by the i?p-norm-'^ II^IL = (X)r=i l^'D ^'^^ some p £ (0,1] 
providing the £p-minimization 

argniin ||x||p s.t. ||Ax-y||2<e. (3) 

In particular, at p = 1 the £i-minimization can be solved in polynomial time using convex programming 
algorithms. Several theoretical and experimental results [see e.g., 7, 20, 21] suggest that €p-minimization 
with p e (0, 1) requires fewer observations than the £i-minimization to produce accurate estimates. However, 
^p-minimization is a non-convex problem where finding the global minimizer is not guaranteed and can be 
computationally more expensive than the .^i-minimization. 

An alternative approach in the framework of sparse linear regression is to solve the sparsity-constrained 
least squares problem 

1 2 

argmin - ||Ax-y||2 s.t. ||x||o < s, (4) 

where s — ||x*||q is given. Similar to (2) solving (4) is not tractable and approximate solvers must be 
sought. Several compressed sensing algorithms jointly known as the greedy pursuits including Iterative 
Hard Thresholding (IHT) [3], Subspace Pursuit (SP) [10], and Compressive Sampling Matching Pursuit 
(CoSaMP) [18] are implicitly approximate solvers of (4). 

As a relaxation of (4) one may also consider the £p-constrained least squares 

1 2 

argmm -||Ax-y||2 s.t. ||x||p < i?*, (5) 

given R* = ||x*|| The Least Absolute Shrinkage and Selection Operator (LASSO) [22] is a well-known 
special case of this optimization problem with p = 1. The optimization problem of (5) typically does not have 
a closed-form solution, but can be (approximately) solved using iterative Projected Gradient Descent (PGD), 
which has been outlined in Section 2. Previous studies of these algorithms, henceforth referred to as £p-PGD, 
are limited to the cases oi p = and p = 1. The algorithm corresponding to the case of p = is recognized 
in the literature as the IHT algorithm. The Iterative Soft Thresholding (1ST) algorithm [2] is originally 
proposed as a solver of the Basis Pursuit Denoising (BPDN) [9], which is the unconstrained equivalent of 
the LASSO with the £i-norm as the regularization term. However, the 1ST algorithm also naturally describes 
a PGD solver of (5) for p = 1 [see for e.g, 1] by considering varying shrinkage in iterations, as described 
in [2], to enforce the iterates to have sufficiently small £i-norm. The main contribution of this paper is a 
comprehensive analysis of the performance of £p-PGD algorithms for the entire regime of p S [0, 1]. 

In the extreme case of p = we have the £o-PGD algorithm which is indeed the IHT algorithm. Unlike 
conventional PGD algorithms, the feasible set — the set of points that satisfy the optimization constraints — 
for IHT is the non-convex set of s-sparse vectors. Therefore, the standard analysis for PGD algorithms with 
convex feasible sets that relies on the fact that projection onto convex sets defines a contraction map will no 
longer apply. However, imposing extra conditions on the matrix A can be leveraged to provide convergence 
guarantees [3, 13]. 

At p = 1 where (5) is a convex program, the corresponding i!i-PGD algorithm has been studied under 
the name of 1ST in different scenarios (see [2] and references therein) . Ignoring the sparsity of the vector x* , 



^Thc term "norm" is used for convenience throughout the paper. In fact, the £o functional violates the positive scalability 
property of the norms and the £p functionals with p £ (0, 1) arc merely quasi-norms. 



it can be shown that the 1ST algorithm exhibits a subhncar rate of convergence as a convex optimization 
algorithm [2]. In the context of the sparse estimation problems, however, faster rates of convergence can 
be guaranteed for 1ST. For example, in [1] PGD algorithms are studied in a broad category of regression 
problems regularized with "decomposable" norms. In this configuration, which includes sparse linear regres- 
sion via 1ST, the PGD algorithms are shown to possess a linear rate of convergence provided the objective 
function — the squared error in our case — satisfies Restricted Strong Convexity (RSC) and Restricted Smooth- 
ness (RSM) conditions [1]. These two conditions basically control the curvature of the objective function 
being restricted to (nearly) sparse vectors. Although the results provided in [1] consolidate the analysis of 
several interesting problems, they do not readily extend to the case of ^p-constrained least squares since the 
constraint is not defined by a true norm. 

In this paper, by considering ip-halls of given radii as feasible sets in the general case, we study the 
^p-PGD algorithms that render a continuum of sparse reconstruction algorithms, and encompass both the 
IHT and the 1ST algorithms. In Section 2 using the Restricted Isometry Property (RIP) [5] we provide 
accuracy guarantees for ^p-PGD algorithms which assert that these algorithms converge to the true signal 
up to a multiple of the noise level at a linear rate. Furthermore, our results suggest that as p increases 
from zero to one the convergence and robustness to noise deteriorates. This conclusion is particularly in 
agreement with the empirical studies of the phase transition of the 1ST and IHT algorithms provided in 
[16]. Our results for io-PGD coincides with the guarantees for IHT derived in [13]. Furthermore, to the best 
of our knowledge the RIP-based accuracy guarantees we provide for 1ST, which is the f i-PGD algorithm, 
have not been derived before. The last section of the paper. Section 3, is dedicated to discussion of some 
details and future work. 

Notation. Throughout the paper we assume that the vectors and matrices have complex entries unless stated 
otherwise. The set {1,2,..., n} is denoted by [n] for brevity. We use Mj to denote restriction of the matrix 
M to the columns selected by the set of indices J C [n]. Similarly, v|j denotes restriction of the vector v to 
the entries with indices in 3. Depending on the context, the vector v|j may also denote a vector that is equal 
to the vector v except for the part supported on J where it is zero. The set of non-zero entries (i.e, the 
support set) and the best s-term approximation of vector v are denoted by supp (v) and v., , respectively. 
Furthermore, the matrix M'^ denotes the Hermitian conjugate of the matrix M. The inner product of 
vectors u and v is denoted by (u, v). Finally, 5R [•] and Arg(-) denote the real part and the phase of their 
arguments, respectively. 

2. Projected Gradient Descent for ^p-constrained Least Squares 

One of the most elementary tools in convex optimization for constrained minimization is the PGD 
method. For a differentiable convex objective function / (•), a convex set Q, and a projection operator Pq (•) 
defined by 

PQ(x)=argmin ||x— u||2 s.t. u G Q, (6) 



Algorithm 1: Project Gradient Descent 
input : Objective function / (•) and an operator Pq (•) that performs projection onto the set Q 

Choose the initial point x° e Q 

ki — 

repeat 

Choose a step-size rjk > 

x'^+i^PQ(x*^--r;fcV/(x'^)) 

ki — k + 1 
until halting condition holds 
output: the (approximate) minimizer x*"' 



the PGD algorithm solves the mininiization 

argmin / (x) s.t. x e Q 



via the iterations outlined in Algorithm 1. For example, in a broad range of applications where the objective 

^l|Ax-y||^, 



fmiction is the squared error of the form / (x) = ^ j|Ax — yjjj, the iterate update equation of the PGD 



method in Algorithm 1 reduces to 



X 



fc+l — -p„ (^k 



x^--rftA^(Ax^-y)). (7) 

In the context of compressed sensing if (1) holds and Q is the ^i-ball of radius ||x*||j^ centered at the 
origin, Algorithm 1 reduces to the 1ST algorithm (except perhaps for variable step-size) that solves (5) 
for p = 1. By relaxing the convexity restriction imposed on Q the PGD iterations also describe the IHT 
algorithm where Q is the set of vectors whose £o-norm is not greater than s = ||x*||q. 

Henceforth, we refer to an £p-ball centered at the origin and aligned with the axes simply as an ^p-ball 
for brevity. To proceed let us define the set 

Jp(c) = |xeC"|X:kr<4, (8) 

for c e R+, which describes an £p-ball. Although c can be considered as the radius of this £p-ball with 
respect to the metric (i(a, b) = j|a — b||p, we call c the "p-radius" of the fp-ball to avoid confusion with 
the conventional definition of the radius for an £p-ball, i.e., maxxgg^ (c) ||x|| . Furthermore, at p = where 
Jp (c) describes the same "^o-ball" different values of c, we choose the smallest c as the p-radius of the 
^p-ball for uniqueness. In this section we will show that to estimate the signal x* that is either sparse or 
compressible in fact the PGD method can be applied in a more general framework where the feasible set is 
considered to be an €p-ball of given p-radius. Ideally the p-radius of the feasible set should be |ix*||p, but 
in practice this information might not be available. In our analysis, we merely assume that the p-radius of 
the feasible set is not greater than ||x*|jp, i.e., the feasible set does not contain x* in its interior. 

Note that for the feasible sets Q — Jp (c) with p G (0, 1] the minimum value in (6) is always attained 
because the objective is continuous and the set Q is compact. Therefore, there is at least one minimizer in 
Q. However, for p < 1 the set Q is nonconvex and there might be multiple projection points in general. For 
the purpose of the analysis presented in this paper, however, any such minimizer is acceptable. Using the 
axiom of choice, we can assume existence of a choice function that for every x selects one of the solutions of 
(6). This function indeed determines a projection operator which we denote by Pq (x). 

Many compressed sensing algorithms such as those of [3, 4, 10, 18] rely on sufficient conditions expressed 
in terms of the RIP of the matrix A. We also provide accuracy guarantees of the ^p-PGD algorithm with 
the assumption that certain RIP conditions hold. The following definition states the RIP in its asymmetric 
form. This definition is previously proposed in the literature [14], though in a slightly different format. 

Definition (RIP). Matrix A is said to have RIP of order s with restricted isometry constants a^ and Ps if 
they are in order the smallest and the largest non-negative numbers such that 

^^||x||2 < ||Ax||2 < asl|x||2 

holds for all s-sparse vectors x. 

In the literature usually the symmetric form of the RIP is considered in which as — l + Ss and 13^ ~ l — Sg 
with 6s € [0,1]. For example, in [13] the i'l-minimization is shown to accurately estimate x* provided 
S-is < 3/ (4 + -v/G) « 0.46515. Similarly, accuracy of the estimates obtained by IHT, SP, and CoSaMP are 

guaranteed provided Sss < V2 [13], S3S < 0.205 [10], and ,54^ < ^27(5^7^ ~ 0.38427 [13], respectively 
As our first contribution, in the following theorem we show that the £p-PGD accurately solves £p- 
constrained least squares provided the matrix A satisfies a proper RIP criterion. To proceed we define 

as + Ps 



which can be interpreted as the equivalent of the standard RIP constant 6s in the asymmetric form of RIP. 

Theorem 2.1. Let x* be an s-sparse vector whose compressive measurements are observed according to (1) 
using a measurement matrix A that satisfies RIP of order 3s. To estimate x* via the ip-PGD algorithm 
an £p-ball Ti with p-radius c (i.e., 23 = 3'p{c)) is given as the feasible set for the algorithm such that 
c — {1 — ef ||x*|j!^ for some^ e G [0, 1). Furthermore, suppose that the step-size rjk of the algorithm can be 

chosen to obey 



r]kia3s+P3s) 



1 



< T for some r > 0. // 

{1 + t)p3s+t<— , (9) 

( 1 \1/2~1/P 

with ^ (p) denoting the function ^Jp I 2I— ) , then x. , the k-th iterate of the algorithm, obeys 

< {2^f \\x*\\^ + ^ii±^ (1 + e b)) (e (1 + P3.) Ilx*li2 + ^^ Hell,) + e \W\\^ , (10) 

-L - ^7 V CK3s + P3s / 



Ix*" - X" 



where 



7=((l + r)p3,s+T)(l + V2Cb))'. (11) 

Remark 2.1. Note tliat the parameter e indicates how well the feasible set 23 approximates the ideal feasible 
set 23* = 3^p I ||x*|Pj. The terms in (10) that depend on e determine the error caused by the mismatch 



between 23 and 23*. Ideally, one has £ = and the residual error becomes merely dependent on the noise 
level ||e||2. 

Remark 2.2. The parameter r determines the deviation of the step-size 77^ from — — — g — which might not be 
known a priori. In this formulation, smaller values of r are desirable since they impose less restrictive condi- 
tion on p3s and also result in smaller residual error. Furthermore, we can naively choose r]k = |jAx||2 / Hxjlj 



^^aM±^_l 



< °^°~P^° . There- 



for some 3s-sparsc vector x 7^ to ensure l/ass < ?7fc < l/Z^ss and thus 

fore, we can always assume that r < "^^g" ■ 

Remark 2.3. Note that the function ^ (p), depicted in Fig. 1, controls the variation of the stringency of the 
condition (9) and the variation of the residual error in (10) in terms of p. Straightforward algebra shows 
that ^ (p) is an increasing function of p with ^ (0) = 0. Therefore, as p increases from zero to one, the RHS 
of (9) decreases, which implies the measurement matrix must have a smaller p^^ to satisfy the sufficient 
condition (9). Similarly, as p increases from zero to one the residual error in (10) increases. To contrast 
this result with the existing guarantees of other iterative algorithms, suppose that r = 0. e = 0, and we 
use the symmetric form of RIP (i.e., a^s = 1 + S^s and P^s = 1 — ^Ss) which implies p^s = 5^s- At p = 0, 
corresponding to the IHT algorithm, (9) reduces to 5^s < V^ that is identical to the condition derived in 
[13]. Furthermore, the required condition at p = 1, corresponding to the 1ST algorithm, would be 63s < ^/s. 

The guarantees stated in Theorem 2.1 can be generalized for nearly sparse or compressible signals that 
can be defined using power laws as described in [6]. The following corollary provides error bounds for a 
general choice of x* . 

Corollary 2.1. Suppose that x* is an arbitrary vector in C" and the conditions of Theorem 2.1 hold for 
X*, then the k-th iterate of the ip-PGD algorithm provides an estimate of x.* that obeys 



^At p = we have (1 — e) =1 which enforces c = ||x*||g. In this case e is not unique, but to make a coherent statement 
we assume that e = 0. 



^ 




2-p 



Figure 1: Plot of the function § (p) = y^ I ^^ J " '' which determines the contraction factor and the residual error. 



Ix^-x-|i <(2,)--|ix:ii, + ^(^+:)(!.+^(^» re(i+P3.)iix:ii, + ^^^iiix--xr 



12 — v- /; n"sil2 



1-27 



"35 + /33s 



/2s 



ass + Pss 



.* ,.* I 



Proof. Let e = A (x* — x*) + e. We can write y = Ax* + e ~ Ax* + e. Thus, we can apply Theorem 2.1 
considering xj as the signal of interest and e as the noise vector and obtain 



|x^-x:|| <(27)^||x:||, + gii±^(l+ap))fg(l + P3.)||x:H2+ ^^^ Hel 



1 -27 



"35 + P3s 



2, r£||x:||2. (12) 



Furthermore, we have 



= ||A(x*-x:)+e||2 
<|lA(x*-x:)||2 + !|e| 



Then applying Proposition 3.5 of [18] yields 



relU < Va2^ 1 ||x* - xtllo + ^= i|x* - X* 



^sll2 



12s 



sill 



Applying this inequality in (12) followed by the triangle inequality llx''" — x*|| < llx'^ — x*|| + Ijx* — xJHj 
yields the desired inequality. ■ 

To prove Theorem 2.1 first a series of lemmas should be established. In what follows, x^is a projection 
of the s-sparse vector x* onto 23 and x* — x^ is denoted by d*. Furthermore, for A: = 0, 1, 2, ... we denote 
x*^ — x^ by d*^ for compactness. 



Lemma 2.1. //x*"' denotes the estimate in the k-th iteration of £p-PGD, then 

||d'=+i II2 < 23? [(d^ d'^'+i) - rjk (Ad^ Ad'^+i)] + 2?7fc3fi (Ad'^+\ Ad* + e) , 



Proof. Note that x'^+^ is a projection of x*"' — rj^A^ (Ax*^ — y) onto S . Since x^ is also a feasible point (i.e. 
x'^+i - x^ + rj,A» (Ax^- - y) 11; < ||xl - x'= + ry^-A^ (Ax^ - y) ||^ 



x^ G ®) we have 



Using (1) wc obtain 

||dfe+i - d^ + TyfeA^ (A (d^ - d*) - e) \\l < ||-d^ + r/^A^^ (A (d^ - d*) - e) \\l . 

Therefore, we obtain 

^ (d'=+\ d'^^+i - 2d'^ + 2?7fcA" (Ad''^ - (Ad* + e))) < 

that yields the the desired result after straightforward algebraic manipulations. ■ 

The following lemma is a special case of the generalized shifting inequality proposed in [13, Theorem 2]. 
Please refer to the reference for the proof. 

Lemma 2.2 {Shifting Inequality [13]). If < p < 2 and 

Ul > U2 > ■ ■ ■ > Ui > U;-|-i > ■ ■ ■ > Ur > Ur+l ^ ' ' ' ^ Ur+l > 0, 



then for C = ma.x I r ■^ "'VTls^O^ " ( ' 



/ l+r \ 2 / r \p 

E-n ^^ E< ■ (13) 

Lemma 2.3. For x^, a projection of n* onto 23, we have supp (x^) C § := supp (x*). 

Proof. Proof is by contradiction. Suppose that there exists a coordinate i such that x* — but x*j_^ ^ 0. 
Then one can construct vector x' which is equal to x^ except at the z-th coordinate where it is zero. 
Obviously x' is feasible because IJx'P < ||x^||^ < c. Furthermore, 

n 

||x*-x'||^=5^|4-4|' 



EI4 



* |2 
n 

<^\x*~x* f 

= l|x*-xl||2. 



This is a contradiction since by definition 



x^ e argnhn - ||x* - x||2 s.t. |lx|l^ < c. 



'^k 



§fc,l I §fc,2 



'fc,J 



>fc,j + l 



'J'/c, 



Figure 2: Partitioning of vector d = x*" — x^. The color gradient represents decrease of the magnitudes of the corresponding 
coordinates. 

To continue, we introduce the following sets which partition the coordinates of vector d*^ for fc = 0, 1, 2, . . .. 
As defined previously in Lemma 2.3, let S = supp (x*). Lemma 2.3 shows that supp (x^) C §, thus we can 
assume that x^ is s-sparse. Let S^-^i be the support of the s largest entries of d''\g^ in magnitude, and define 
Tfc = S U Sfc,i. Furthermore, let 8^,2 be the support of the s largest entries of d'^j^c , §k,3 be the support of 
the next s largest entries of d'^'jr^c ^ and so on. We also set T^.j = §k.j U §k.j+i for j > 1. This partitioning 
of the vector d''' is illustrated in Fig. 2. 
Lemma 2.4. For k — 0,1,2, . . . the vector d'^ obeys 



El|d^ 



ISfc 



fc,i|l2 



< V2p 



2s 
2~^ 



IdlsH 



fc.3 Il2 ' 



Proof. Since §k,j and Sfcj+i are disjoint and T^.^ ~ S^.j U S^.j+i for j > 1, we have 

lld'^ls II + lld'^ls II < \/^IIh'=U II 

II I°fc,3ll2 II l^*:,J + l|l2 — 

Adding over even j's then we deduce 

El|d'kJ|2<^El|d'k,: 

Because of the structure of the sets 7kj, Lemma 2.2 can be applied to obtain 



^2i||2 ■ 



\d^W,A^<VP 



2s 
2~p 



\a^\ II 

|d b'=,.-i|lp- 



(14) 



2 p 



To be precise, based on Lemma 2.2 the coefficient on the RHS should be C = max < (2s) ^ p , -^/l" \tz~] 
For simplicity, however, we use the upper bound C < y/p I ^^ ) . To verify this upper bound it suf- 
fices to show that {2s)'^ p < \/p ( 2^ ) '' '^^ equivalently (p) = plogp + (2 — p) log (2 — p) > for 
p G (0, 1]. Since </> (•) is a deceasing function over (0, 1], it attains its minimum at p = 1 which means that 
0(p) > 0(1) = as desired. 
Then (14) yields 



El|d'=kJl2<v/^(^ 



EK 



Since uji + uj2 + ■ ■ ■ + uji < {tOi + wf + • • • + ujf) ^ holds for wi, • • • , wj > and p G (0, 1], we can write 

El|d1...-JI,<fEl|d^ 

i>l \J>1 

The desired result then follows using the fact that the sets 7k,2i-i are disjoint and lJi>i '^fc,2j-i = § • 



,fe| IIP 



Proof of the following Lemma mostly relies on some common inequalities that have been used in the 
compressed sensing literature (see e.g., [8, Theorem 2.1] and [15, Theorem 2]) . 



Lemma 2.5. The error vector d'' satisfies d' 






Proof. Since supp (x^) C S = supp (x*) we have d jgc = x jgc . Furthermore, because x is a feasible point 
by assumption we have ||x'^|| <c= ||x^||^ that implies. 



\^ Is' L 



< X* 



I 1'^ Mp 



<hi--%\\l 



- llH^I 11^ 
M I" Mp 

<si-5||d'=U 



(power means inequality) 



which yields the desired result. 



The next lemma is a straightforward extension of a previously known result [11, Lemma 3.1] to the case 
of complex vectors and asymmetric RIP. 

Lemma 2.6. For u, v G C" suppose that matrix A satisfies RIP of order max(||u + v||q , ||u — v||q) with 
constants a and /3. Then we have 



|3ff [77 (Au, Av) -(u,v)]| < 



7jia~(3) 



Tl{a + I3) 



-1 



u 



Proof. If either of the vectors u and v is zero the claim becomes trivial. So without loss of generality we 
assume that none of these vectors is zero. The RIP condition holds for the vectors u ± v and we have 



/3||u±v||^<||A(u±v)||^<a||u±v||^ 



Therefore, we obtain 



5R(Au,Av> = -(||A(u + v)l|^-|lA(u-v)||^ 

^ ^ ( W , 1|2 on l|2 

< - (^a||u + v||2-/3||u-v 



a — /3 



u 



2 
a + /3 



5R (u, v) 



Applying this inequality for vectors y M, and 11^1 yields 

77(a-/3) 



^ 



'y(A^,A^ 



u 



< 



ri{a + P) 



-ISR 



< 



r,{a-P) 



V (a + (3) 



Similarly it can be shown that 
5ft '■ " 



u 



^ A- 



A- 



u 



u 



u 



> - 



77(a-/3) 



- 1 



77 (a + /3) 



1 



2 II'II2 

The desired result follows immediately by multiplying the last two inequalities by [[uHj |jv 

9 



Lemma 2.7. // the step-size of £p-PGD obeys Irji^. [a^g + /?3s) /2 — 1\ < t for some r > 0, then we have 

5^[(d^d^+l)-r?fe(Ad^Ad'=+l)] < {{1 + t) p,, + r) h+^f^Y M ||d'=|y|d'=+i||2. 

Proof Note that 

5R [(d^d^+l) - ,7fe (Ad^ Ad'=+i)] = 5R [(d^V„d^-+V...) - 'y^ (Ad'^lT,, Ad'=+i|T,+,)] 

+ 5]5R[(d'=U,,„d^+V.+.)-^fc(Ad1s.,.,Ad'^-+V,,,)] 

+ ^ 5R [(d^V. , d^+^ Is,,, J - ry, ( Ad^V. , Ad'^+i U,,,,^. )] 

+ 5]5i[(d^-U,,,,d'^-+i|s,,,,^)-r?,(Ad'^|s,,^,Ad^-+i|s,,,,^)]. (15) 

Note that jT^ U T^+ij < 3s. Furthermore, for i,j>2 we have \7k U §k+i,j\ < 3s, \7k+i U Sfc_i| < 3s, and 
|Sfe.i U Sfc+ijl < 2s. Therefore, by applying Lemma 2.6 for each of the summands m (15) and using the fact 
that 



we obtain 



Pss ■= (1 + t) P3s + T 

> m {ass - f33s) /2 + \rjk (as. + Pas) /2 - 1| 



5R[(d^d'^■+l)-r,,(Ad^Ad^+l)]<p:,,||d'^l^Jy|d^+lk,JI, + 5:pf,J|d'=k,,|y|d'=+l 

i>2 



T.. 



fc + i II2 



Hence, applying Lemma 2.4 yields 

5ft [(d^d^+l) - r,k (Ad^ Ad^+1)] < p^, Hd'^i^j^ Hd'^+i 



E^' llrl'^l II llrl'-'+l| II ^ V^ V llrl'^l II llrlfc+l| II 

P3s ||d iTfclla ||d |s;,+i,j||2 + 2^ Pss ||d Isfc,, II2 ||d |Sfc+i,,||2- 

j>2 i,j>2 



2s y ' 

2~^J ^3''ll" '■■''■II2 II" l«'=llp 

/ II .fcl II II it- -4-1 1 ,, 

IsHlp- 



-V^(^)"%3.||d^k|U|d'^^^|. 

-2pfx^V %3J|d'|s^|| ||d^+i 
^ \2-n '^2" II ' lip II 



Then it follows from Lemma 2.5, 

3?[(d^d^+l)-ry,.(Ad^Ad'^-+l)]<p^,lld'=kly|d'=+l|^,,J^ 



1 — 1 

-V2^(^)"%3„||d'^-k|U|d'=+^h 

-2p(^)"%L||d'=U|U|d'=+^U||2 
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<pL(i + v/2^(^)"') ||d1U|d- „, 



,fe+i| 



Now we are ready to prove the accuracy guarantees for the £p-PGD algorithm. 
Proof of Theorem 2.1. Recall that 7 is defined by (11). It follows from Lemmas 2.1 and 2.7 that 

||d'=||2<27||d^|| ||d'^-i|| +277fe3ft;(Ad^Ad* + e) 



< 27 d'^ 



]fe-i| 



2?/fc||Ad'=|L||Ad* + e|| 



Furthermore, using (14) and Lemma 2.5 we deduce 

||Ad'=||^<||Ad'^|7j2 + El|Ad'k,..||2 



< V^lld'^ia-, 



MI2 + 



Y,V^s\\d%^ 



i>l 



< VaSrlld'^ia-, 



fc|l2 



fa^s^/p 



k.2i II2 



2s \ 2 " 



< \/ail||d''|a-J + ^/a^s^/p 



< Toil 1 1 d'' It J L + x/oilVP 



< \/a2l 1 + VP 



2.S 



El|d^ 



iTfc. 



P 



i>i 



id'^lse 



2-p 



2-p 

^ lld^ 



Therefore, 



Id'-^ll^ < 27 lld^'ll^ ||d'=-i||2 + 2ryfeV^ 1 + a^ 



2-p 



|d^||J|Ad*+ej|2, 



which after canceling Hd*^!! yields 



2 \^ " 



|d'=||2<27||d^-i||2 + 27?fcV^ll + Vp(^l I i|Ad*+e| 



27 d^-^ +27?fc(a3.+/33.) ^,; l + \^ 



,fe-l|l , w, , ^ V"2,s / . ^/ 2 



2-p 



IIAd* + e|| 



<27 d'^-M +4(1 + t) 



ass + Pss 



1 + Vp 



Since x^ is a projection of x* onto the feasible set S and 



2-p 

i/p 



(!|Ad*||2 + ||ej|2). 



X* e 25 we have 



|d1|2 = |lxl-x1l2 



i/p 



< 



£ X 
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Furthermore, supp (d*) C S, thereby we can use RIP to obtain 

|lAd*||2<V^||d*||2 

<eV^I|x1l2- 
Hence, 



IdHI <27||d^-MI +4(l + r) "^ /I 1 + VpU (eVHIilx*||2 + ||e 



2 ' 11^112/ 

<27||d'^-MI +2(l + r)(l + ^(-^)' ") (e(l + p3.)!lx1l2 + . ^^^^ 




Applying this inequahty recursively and using the fact that 



fe-i 



which holds because of the assumption 7 < ^ , we can finally deduce 



Ix*"' — x*ll — lld'^ — H*ll 

9 — 9 



<l|di2 + !|d1l2 

1 - 27 V as,, + fis, 

'-f^ (1 +?b))(.(l + P3.) 11x11, + ^„ 
1-27 \ a3s + P3 



< (27f llxlll, + -^--^ (1 + e ip)) U (1 + P3.) !|x1l2 + /^, ^: Hell, + i|d*l|. 



< (27f llxll, + ^^ (1 + ? b)) U (1 + P3.) Ilxll, + „ ^, , \\eh]+s ||x* 



where S, (p) = y^ ( ^— 1 " as defined in the statement of the theorem. ■ 

3. Discussion 

In this paper we studied the accuracy of the Projected Gradient Descent algorithm in solving sparse 
least squares problems where sparsity is dictated by an £p-norm constraint. Assuming that one has an 
algorithm that can find a projection of any given point onto £p-balls with p G [0, 1], we have shown that the 
PGD method converges to the true signal, up to the statistical precision, at a linear rate. The convergence 
guarantees in this paper are obtained by requiring proper RIP conditions to hold for the measurement 
matrix. By varying p from zero to one, these sufficient conditions become more stringent while robustness 
to noise and convergence rate worsen. This behavior suggests that smaller values of p are preferable, and in 
fact the PGD method at p = (i.e., the IHT algorithm) outperforms the PGD method at p > in every 
aspect. These conclusions, however, are not definitive as we have merely presented sufficient conditions for 
accuracy of the PGD method. 

Unfortunately and surprisingly, for p € (0, 1) the algorithm for projection onto £p-hal\s is not as simple 
as the cases of p = and p = 1, leaving practicality of the algorithm unclear for the intermediate values p. 
We have shown (see the Appendix) that a projection x-'- of point x G C" has the following properties 

(i) 2;^ < \xi\ for all i G [n] while there is at most one i G [n] such that x^ < ,5^ \xi\, 

(fi) Arg(a;i) = Arg(a;j^) for i G [n], 

(iii) if \xi\ > \xj\ for some i,j £ [n] then |x^| > \xf\, and 
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(iv) there exist A > such that for all i G supp (x-'-) we have Ix^l (\xi\ — |a;^|) = pA. 

However, these properties are not sufficient for full characterization of a projection. One may ask that if the 
PGD method performs the best at p = then why is it important at all to design a projection algorithm 
for p > 07 We believe that developing an efficient algorithm for projection onto ^p-balls with p G (0, 1) is an 
interesting problem that can provide a building block for other methods of sparse signal estimation involving 
the ip-noim. Furthermore, studying this problem may help to find an insight on how the complexity of these 
algorithms vary in terms of p. 

In future work, we would like to examine the performance of more sophisticated first-order methods such 
as the Nesterov's optimal gradient methods [19] for ^p-constrained least squares problems. Furthermore, it 
could be possible to extend the provided framework further to analyze £p-constrained minimization with 
objective functions other than the squared error. This generalized framework can be used in problems such 
as regression with generalized linear models that arise in statistics and machine learning. 
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Appendix A. Lemmas for Characterization of a Projection onto £p-balls 

In what follows we assume that "B is an ^p-ball with p-radius c (i.e., 23 = 9p (c)). For x G C" we derive 
some properties of 

1 , 

x^ e argmin - ||x- UJI2 s.t. u G 23, (A.l) 

a projection of x onto 13. 

Lemma A.l. Let x^ be a projection of x onto 23. Then for every i G {1,2, ...,n} we have Arg(a;i) = 
Arg (x^) and |x^| < \xi\. 

Proof. Proof by contradiction. Suppose that for some i we have Arg (xi) ^ Arg (x^) or Ix^l > \xi\. Consider 
the vector x' for which .t'- = xj- for j y^ i and x'^ = min { |a;i | , x^ } exp (lArg (xi)) (the character z denotes 
the imaginary unit \/—l). Wc have ||x'|| < x which implies that x' G 23. Since \xi — a;^| < \xi — x:p\ 
we have j|x' — x||2 < x^ — x which contradicts the choice of x*'- as a projection. ■ 

Assumption. Lemma A.l asserts that the projection x^ has the same phase components as x. Therefore, 
without loss of generality and for simplicity in the following lemmas we assume x has real-valued non-negative 
entries. 

Lemma A. 2. For any x in the positive orthant there is a projection x^ of x onto the set 23 such that for 
i, j G {1, 2, . . . , n} we have xf- < xj- iff Xi < Xj . 

Proof. Note that the set 23 is closed under any permutation of coordinates. In particular, by interchanging 
the t-th and j-th entries of x^ wc obtain another vector x' in 23. Since x^ is a projection of x onto 23 we 
must have ||x — x-'-jj^ < ||x — x'W^. Therefore, we have (x, — a;^) + [xj — xj-)" < [xi — xj-)" + (xj — x,^) 
and from that < {xi — Xj) (x^ — xj-) . For Xi y^ Xj the result follows immediately, and for Xi ~ Xj without 
loss of generality we can assume x^ < xj- . ■ 

Lemma A. 3. Let S^ be the support set of x^ . Then there exists a A > such that 

Xj ~^' {xi - xl) = pX 

for all i G S^ . 

Proof. The fact that x^ is a solution to the minimization expressed in (A.l) implies that that x-'-|gj_ must 
be a solution to 

argmm -|jx|si - VII2 s.t. |lv|l^ < c. 

The normal to the feasible set (i.e., the gradient of the constraint function) is uniquely defined at x^|gi 
since all of its entries are positive by assumption. Consequently, the Lagrangian 

L(v,A) = i|lx|s. -v|1^+a(||v|1^- 

has a well-defined partial derivative ^ at x-^-jg^ which must be equal to zero for an appropriate A > 0. 
Hence, 

Vi G S^ xj- - X, +pXxt^"~^'' = 
which is equivalent to the desired result. ■ 
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Lemma A. 4. Let A > and p G [0, 1] be fixed numbers and set Tq ^ {2 — p) Ip {1 — pY A ) . Denote 
the function t^~'P (T — t) by hp (<). The following statements hold regarding the roots of hp (t) = pX: 

(i) For p ~ 1 and T > Tq the equation hi (t) = A has a unique solution at t = T — X Cz [0,T] which is an 
increasing function of T . 

(ii) For p E [0, 1) and T > Tq the equation hp (t) = pX has two roots i_ and t_|_ satisfying t_ G I 0, 2^"^ 

and t^ E :^z^T,+oo] . As a function ofT, t- andt+ are decreasing and increasing, respectively and 
they coincide at T = Tq. 

Proof. Fig. A. 3 illustrates hp (i) for different values of p G [0,1]. To verify part (i) observe that we have 
To = A thereby T > X. The claim is then obvious since hi (t) — X = T — t — X is zero at i = T — A. Part 
(ii) is more intricate and we divide it into two cases: p = and p ^ 0. At p = we have Tq = and 
^0 (t) = t {T — t) has two zeros at i_ = and t+ = T that obviously satisfy the claim. So we can now 
focus on the case p G (0, 1). It is straightforward to verify that imax = 2^-^ ^^ ^^^ location at which hp (t) 
peaks. Straightforward algebraic manipulations also show that T > Tq is equivalent to pX < hplt^i^y). 
Furthermore, inspecting the sign of h' (t) shows that hp{t) is strictly increasing over [0,iniax] while it is 
strictly decreasing over [tmax,r]. Then, using the fact that hp (0) = hp (T) ^ < pX < hp (tmax), it follows 
from the intermediate value theorem that hp (t) = pX has exactly two roots, t_ and i+, that straddle tmax 
as claimed. Furthermore, taking the derivative of tS^ {T ~ t-) = pX with respect to T yields 

(1 - p) t'_t-J (T - f_) + t^S^ (1 - t'_) = 0. 
Hence, 

((i-p)(r-f_)-f_)f'_ = -f_ 

which because t__ < <niax = 2^-^ implies that t'_ < 0. Thus <_ is a decreasing function of T. Similarly we 
can show that t^ is an increasing function of T using the fact that t+ > i„iax- Finally, as T decreases to 
To the peak value hp (tmax) decreases to pX which implies that t_ and i_|_ both tend to the same value of 



Lemma A. 5. Suppose that xi ~ Xj > for some i y^ j- If xf^ = x^ > then xf- > -r^^i ■ 

Proof. For p G {0, 1} the claim is obvious since at p = wc have x-^ = Xi > ■^Xi and a.t p ~ 1 we have 
jE^Xi = 0. Therefore, without loss of generality we assume p G (0, 1). The proof is by contradiction. 

Suppose that ^ ~ ~^ — ^ < 5^ ■ Since x^*- is a projection it follows that a — b — w must be the solution 
to 

1 r 



argmin tp 

a^b 2 



{l-af + {l-bf 



s.t. ai" + b'' = 2wP, a > 0, and 6 > 0, 



otherwise the vector x' that is identical to x^'^ except for x'^ = axi 7^ xf' and x' = bxj 7^ x^ is also a feasible 
point (i.e., x' G 23) that satisfies 

llx' - x|l^ - ||x^ - x||^ = (1 - af x^ + (1 - bf x] - (1 - wf x\-{\- wf x] 
= [{l-af + {l-bf-2{l-wf)x^, 
<0, 
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t^-P (T - t) 




t T 

Figure A. 3: The function t^^P (T — t) for different values of p 

which is absurd. If b is considered as a function of a then ij.' can be seen merely as a function of a, i.e., 
TJj = ip (a). Taking the derivative of ip with respect to a yields 

ij'{a) = a-l + b'{b-l) 

= {b^-P (!-&)- a^-P (1 - a)) flP-i 



[2 - p) [b - a)iy-P 



2-p 



where the last equation holds by the mean value theorem for some v G (min{a, 6} , max{a, 6}). Since 
w < 23^ we have ri := min < 2^/Pw, -^^ir \ > w and ro := {2wP — rP) < w. With straightforward algebra 
one can show that if either a or 6 belongs to the interval [ro,ri], then so does the other one. By varying a 
in [ro,ri] we always have h' < ri < 23^1 therefore as a increases in this interval the sign of ijj' changes at 
a ~ w from positive to negative. Thus, a = b = w is a local maximum of ■0 which is a contradiction. ■ 
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